Prompting the data transformation activities for cluster analysis on collections of documents
نویسندگان
چکیده
In this work we argue towards a new self-learning engine able to suggest to the analyst good transformation methods and weighting schemas for a given data collection. This new generation of systems, named SELF-DATA (SELF-learning DAta TrAnsformation) relies on an engine capable of exploring different data weighting schemas (e.g., normalized term frequencies, logarithmic entropy) and data transformation methods (e.g., PCA, LSI) before applying a given data mining algorithm (e.g., cluster analysis), evaluating and comparing solutions through different quality indices (e.g., weighted Silhouette), and presenting the 3-top solutions to the analyst. SELF-DATA will also include a knowledge database storing results of experiments on previously processed datasets, and a classification algorithm trained on the knowledge base content to forecast the best methods for future analyses. SELF-DATA’s current implementation runs on Apache Spark, a stateof-the-art distributed computing framework. The preliminary validation performed on 4 collections of documents highlights that the TF-IDF and logarithmic entropy weighting methods are effective to measure item relevance with sparse datasets, and the LSI method outperforms PCA in the presence of a larger feature domain.
منابع مشابه
Protection of Archival Documents from Photochemical Eects
Purpose: The purpose of this paper is to highlight the destructive effects of light on archival documents/paper materials. The research aims to explain the mechanism of photochemical degradation and the damaging effect of light on paper. It also tells us about the measures to be adopted to control the deteriorating effects of light on paper step by step. Design/Methodology/Approach: The res...
متن کاملThe Package for Mental and Social Health Promotion and Drug Abuse Prevention in the Health Transformation Plan: Executive Leadership Challenges and Suggestions
Background and Aim: The “Package for mental and social health promotion and drug abuse prevention” was developed in response to the importance of, and concerns relate to, the mental and social health in the population. Since any policy and plan needs to be assessed to find its weaknesses, strengths and challenges to ensure its successful implementation, this study aimed to find and explain the ...
متن کاملPerformance Evaluation of Cluster Based Algorithm used for Text Document Classification
In this paper we develop a complete methodology for document classification and clustering. We start by investigating how the choice of document features influences the performance of a document classifier and then use our findings to develop a clustering method suitable for document collections. From our study of the effect of frequency transformation, term weighting and dimensionality reducti...
متن کاملDesigning a Model for Teacher Competencies in Elementary Education
Purpose: Teacher competencies in the education system is among the most influential and important issues. This importance is rooted in the critical role of teachers in educating people in a society, because the more teachers are prepared and qualified, the greater their impact on upgrading the education system. Methodology: In this regard, upstream documents, as the most extensive strategic and...
متن کاملInformation Retrieval: A Survey
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous e...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017